0. Distribution of work and initialisation

Work David Ari Ostenfeldt, s194237 Kristian Rhindal Møllman, s194246 Kristoffer Marboe, s194249
Data 40% 30% 30%
Networks 30% 40% 30%
Text 30% 30% 40%
Website 33% 33% 33%
Explainer notebook 33% 33% 33%

Everyone contributed equally to this project.

1. Motivation

What is your dataset?

The dataset we will be analysing is a collection of songs, each with the artists that worked on them, the lyrics, and the release date.

The network will be created with each artist as a node and the links will be if the artists have collaborated on a song.

The text analysis will be conducted on the lyrics of all the songs gathered.

Why did you choose this dataset?

Musicians tend to collaborate together, which we thought would make for an interesting network. Furthermore, investigating the different artists language through their song lyrics to find patterns and attributes would be fun.

What was your goal for the end user's experience?

We wanted to provide some insight into how artists collaborate, which genres and artists collaborate more and how the language between genres and artists differs. Furthermore, by providing the data set for the user, we also let them play around with it on their own, to investigate further genres, or e.g. looking at how a specific artist has developed through the years.

Scraping the data

The first part of any project is collecting the data. We needed a list of songs to collect from Genius, and for this purpose we chose Billboards 'The Hot 100' list. The list goes back all the way to 1958, and updates every week. In theory that should grant us a total of 5200 songs a year * 62 years, which means 322400 possible songs.

To collect the list of songs we used the billboard.py module, which is an interface of Billboards API for python.

Note: The code in this section is not meant to be run, it is simply to show how we collected the data

First we create some helper functions, that we will make use of when searching for songs.

The find_artist function takes a name and returns an artist.
find_song takes an artist and a song title and returns a song.
artist_to_list returns a list of artists.
process_artist_names uses regex to find all the seperate artists in the given name segment.

When searching for songs using the Genius API, we used a sequential searching strategy. This means that we would first search for the song title and full artist name and if that does not yield any results, we first split the artist name at 'feature', 'feat.', 'ft.' or 'with' and then search for the song title and the first partition of the artists name query. If this still doesn't result in any valid song, we remove parentheses from the artist names and replace 'and' with '&', after which we again search for the song title and artists name. If this fails as well, we try splitting the modified artist names at '&' and ',' and search again. If none of these steps result in a valid song, we simply search for the song title and hope for the best.

Immediately after loading a song, we make sure it is actually a song. To do this, we filter out songs with specific genres/tags, as Genius also house texts which are not song lyrics. We therefore used the following list of bad genres to avoid those; ['track\\s?list', 'album art(work)?', 'liner notes', 'booklet', 'credits', 'interview', 'skit', 'instrumental', 'setlist', 'non-music', 'literature'].

The last step before all the raw data was gathered, was to separate all artists for each song. This was done using regex to find and split artists at ',', 'and', 'featuring' and so on. This results in the artists Megan Thee Stallion & Dua Lipa for the song Sweetest Pie to be changed to [Megan Thee Stallion, Dua Lipa] and the artists Lil Durk Featuring Gunna for the song What Happened To Virgil to be changed to [Lil Durk, Gunna]. However, a negative side effect of this processing is, that artists like the previously mentioned Earth, Wind & Fire was changed to [Earth, Wind, Fire]. This was a necessary part of the preprocessing and these kinds of artists were regrouped later in the data cleaning.

Manual lookup of songs

This way, when collecting data for each song through the modified LyricsGenius API, we would retrieve five attributes: date of release, artists who collaborated on the song, lyrics, genres and the song title. The data looks as follows:

released artists lyrics genres title
1957 [marty robbins] El Paso Lyrics\nOut in the West Texas town of ... [country] El Paso
1960-01-04 [frankie avalon] Why Lyrics I'll never let you go\nWhy? Because ... [pop] Why
1959 [johnny preston] Running Bear LyricsOn the bank of the river\nS... [pop] Running Bear
1960-01-04 [freddy cannon] Way Down Yonder in New Orleans LyricsWell, way ... [pop] Way Down Yonder in New Orleans
1960-01-04 [guy mitchell] Heartaches by the Number Lyrics\nHeartaches by... [country, cover] Heartaches by the Number

2. Basic stats

Data Cleaning

At this point we had all the raw data, but it was apparent that in spite of our efforts during the data gathering, a lot of cleaning still had to be done.

Unwanted characters and non-english songs

First of all, unwanted unicodes like \u200b, \u200c and \u200e, which had slipped in when the data was loaded, was removed from artists, genres and the lyrics. Next up, duplicates were removed and songs which were not in english were removed by doing a language detection with the Python module langdetect.

As can be seen in the table above, each of the songs' lyric's begins with the title of the song and 'Lyrics'. This was also removed, as it wasn't part of the actually lyrics, but rather an artifact from gathering the song info using the Genius API.

Create a list of all unique genres

Check if a song is non-english or doesn't have lyrics

Counting the amount of songs:

Removing long songs

Afterwards, we made a decision to remove all songs where the lyrics were longer than 10,000 characters. This was done because, in spite of all the aforementioned approaches to clean the data, e.g. entire book chapters by the French novelist Marcel Proust were still present in the dataset because they were labelled with the genre rap. The cut-off at 10,000 were chosen based on the fact that all songs we investigated that were longer, were songs that we clearly loaded in wrong. In addition to this, the 6-minute-long song Rap God by Eminem, where he flexes his ability to rap fast, contains 7,984 characters.

While doing a finer combing of the data, we also produced a blacklist for artists that we deemed unwanted in the data set. This list includes Glee Cast as they were present in over 200 songs, even though their songs are covers of other popular songs. The full list is seen here ['highest to lowest', 'marcel proust', 'watsky', 'glee cast', 'harttsick', 'eric the red', 'fabvl', 'c-mob', 'hampered'].

Regrouping artists

As mentioned earlier, after gathering the data, we had to separate all artists to work with them properly, though in some cases, this results in one artist being split up into multiple - as was the case with Earth, Wind & Fire. To mitigate this problem, we first calculated how many times each artist appeared in the data set and afterwards, for each artist, how many times they apperead with collaborating artists. Having known these values, we could then for each artist check which other artists they have collaborated with on all of their songs. Artists found using this method were then joined with an underscore, such that ['earth', 'wind', 'fire'] became ['earth_fire_wind'].

Preliminary look at the data

After doing all data processing and cleaning, the final data set is comprised of 25,419 songs and 7,855 unique artists. In the table below, the three data sets used throughout the project can be seen and downloaded.

Data Set Songs Size (mb)
Billboard List 29,128 1.6
Pre-cleaned 29,128 92.5
Cleaned 25,419 44.2

From this figure we can see that Drake has by far the most amount of songs on the Billboard 'Hot-100' list. There's some good diversity in the type of artists with most songs on the list, but they all mainly fall into the rap, r&B or pop genres.

Creating a list of all unique genres and plotting the amount of songs in each genre

It's clear to see that a majority of songs fall into the pop genre, with rock, r&b and rap taking 2nd to 4th place. This is not all that suprising as all these genres have been hugely popular since 1960. Rap however only saw it's inception in the 1990s, but has become a staple in the music industry since then.

And doing the same for decades:

A quick look at the distribution of songs through the decades shows us that a lot of old songs make it to the list, with 1960 having more songs than any other decade on the 'Hot-100' list. 2010 saw a steep increase in the amount of songs on the list compared to previous years. Perhaps there was a shift in what kind of music we were listening to.

Characteristics of the data

The data has now been gathered and thoroughly cleaned, but before we are ready to apply our network science and text analysing techniques to it, we will first look at the ten characteristics of Big Data:

Big

As mentioned previously, the data set comprises 25,419 songs and 7,855 unique artists, but in addition, the lyric corpus has a total size of 8,476,446 with 74,915 unique tokens. With this type of information, a data set of this size would be tough to come by other than scraping the internet.

Always-on

Billboard updates their 'The Hot 100' chart each week, which means the list has been updated since we first collected the data. Because it updates each week, the data set can be updated 52 times a year, which makes the data longitudinal, but since it updates only 52 times a year and not constantly like, e.g. Twitter, it is not entirely always-on.

Non-reactive

Reactivity describes whether subjects know researchers are observing them because that might change the subjects' behaviour. All musical artists are most likely aware that they are present on the chart and might follow their ranking closely, but the question is how much they change their behaviour and musical style to get a higher ranking on the chart. One could speculate that some artists might change their use of words and language to appeal to a broader audience to perform better on the chart, while others follow their musical heart. Though, with this being said, we do not believe that the fact that researchers might also be looking at the chart with the intent to do network science and text analysis will change the behaviour of the artists.

Incomplete

Completeness express if the data set manages to capture the entire unfolding of a specific event or, e.g., the entire network of a specific group. In the case of this project, we are attempting to analyse the network and text of the most popular artists and songs through modern time. With this in mind, we believe that using Billboard's 'The Hot 100' chart gives a good indication of the most popular artists and songs, though arguments could be made for the case that the chart might be skewed towards music popular in the states.

Inaccessable

The data used in this project is very much accessible. As was accounted for earlier on this page, everything has been downloaded freely off the internet via different APIs.

Nonrepresentative

Representativity denotes whether the data can generalise to, e.g., social phenomena more in general - out-of-sample generalisation. To this end, being a musician is quite a unique occupation when it comes to a social network of collaboration, in comparison to, e.g. a profession like acting. One could presume the typical actor is more connected than the typical musician since many actors are associated with a movie or tv-show, while often not many musicians are working on a song. At least not many musicians are seen shown as the artists on a given song, while many people might have worked on it during the songwriting and musical production. Additionally, since our data set only contains songs in English from a popular music chart in the west, the data might not be suited for generalisation of the network, or text, for musicians from other parts of the planet. With this being said, the data set is probably still perfectly applicable for within-sample comparisons.

Drifting

There is some systemic drifting in the data set, as the way songs were picked for the 'Hot-100' list has changed since its inception back in 1958. Originally, songs were picked purely based on how well they sold, but as the music industry evolved and radio, tv and streaming started becoming more prevalent, all these factors are now considered, when songs are picked for the list.

Algorithmically confounded

As the songs are only picked from the Billboard 'Hot-100' list, there is some amount of algorithmic confounding going on.

Dirty

The data set could be dirty as some songs could still be loaded wrongly, or we might have missed something via the cleaning. Furthermore, the data is not a complete overview of the connections between artists or the language they use, as we only chose songs that appeared on the 'Hot-100' list.

Sensitive

The data is not sensitive, as there is no information in it, that isn't already public, as well as the data just being very basic stats, release year, song title, song artists.

3.Tools, theory and analysis.

Network

This section of the notebook will go through the network analysis of the data. We have used networkx to build the networks and netwulf to visualise them. The the following sections we will be investigating the full network of all musicians as well as a subset of them based on selected genres. The networks will be studied by calculating different statistics, such as number of nodes, number of links, density, clusterings and more. In addition, we will look at community detection to see how well the different genres manages to partition the networks into communities in comparison to the Louvain algorithm for community detection.

Network visualisation config.

Creating the full network

Calculate all genres associated to each artist as well as how many songs they have made for each genre.

Creating a list of 20 genres from which each artist can get their main genre label. In addition, a colour list to colour each node based on their main genre.

Calculate number of songs each artist has in the data set as well as how many times they have collaborated with other artists.

Add nodes

Add each artist as a node with three attributes

genre: most common genre for that artist within the fixed list 'genre_list'

size: number of times the artist has appeared on Billboard's the hot 100 (used to give each node the correct size)

all_genres: all genres associated with that artist

group: the colour of the genre associated with the artist

If an artist has multiple most common genres, meaning that they, e.g. have made 5 pop songs and 5 rock songs, the genre attribute for that artist will be picked at random amongst the most common genres. An exception for to is with rap and trap, because trap is a subgenre of rap (but still a major and defined genre), we deem it more appropriate to label artists as trap, if they have an equal number of rap and trap songs.

Add edges

Add edges between two artists if they have collaborated on a song and weigh the edge by the number of times they have collaborated.

Helper functions

Deciding which genre networks to analyse

It was previously decided that each artists could get their main label based on genre_list. Though, analysing and visualising 20 different networks can get a bit cumbersome, so we will be picking out a subset of these. To do this, we will first find out how many artists have each genre as their main genre, and also how many times each genre has occurred in total.

The genres we've have decided to pick out is based on the number of times these genres occur as well as genres we deem interesting. Based on the results seen above, the following 11 genres' networks will be analysed:

pop, rap, rock, R&B, country, soul, ballad, hip-hop, trap, singer-songwriter and funk.

Analysis

The full network has now been created and we are ready to do visualisations and analysis. In the following sections we will be working with the full network and sub-networks described above. For each of the networks we will be investigating the full network as well as versions of the networks where singleton nodes with less than 5 songs are removed.

The reasoning for only removing singleton nodes with less than 5 songs, is that we want to make the networks as clear as possible, while still maintaining the singleton artists that are influential for the genre at hand.

With singletons

From these basic statistics we see that the number of nodes in the networks is 7854 and the number of links is 6799.

The density of an undirected graph is given by:

\begin{align} d=\frac{2m}{n(n-1)}, \end{align}

where $m$ is the number of edges and $n$ is the number of nodes. The interpretation of the measure is, that the density is 0 for a graph without edges and 1 for a completely connected graph, and is therefore a measure of how dense a graph is wrt. edge connectivity. In this case, the network has a density of 0.00022. This can be a little hard to interpret, which is why we've also calculated the average clustering coefficient, that is given by:

\begin{align} \overline{C}=\frac{1}{N} \sum_{i=1}^N \frac{2L_i}{k_i(k_i-1)}, \end{align}

where $L_i$ is the number of links between the $k_i$ neighbours of node $i$. The interpretation of this measure is the probability that two neighbours of a randomly selected node link, to each other. For this network, we have an average clustering coefficient of 0.16.

Lastly, we see that the average degree of the nodes in the graph is 1.73, which means that a node on average is connected to 1.73 other nodes. We also see that both the minimum, median and mode of degrees is 0, whereas the maximum degree is 108.

Analysis of degrees

We will now analyse the degrees of the network a bit more thoroughly by looking at the distribution of degrees on a log-log scale. The reasoning for this is that a common feature for real world networks are hubs - meaning that a few nodes in a network are highly connected to other nodes. Scale-free networks are networks with this presence of large hubs and such networks are characterised by a power-law degree distribution.

Looking at the figure above, we see exactly that the degree distribution of the network follow a power-law, which thus gives good indication that we are dealing with a real world network in comparison to a random network.

Community detection

In this section we will explore the communities of the network. To do this, we are looking at looking at the partition obtained when grouping the artists by their genre. This will be compared to the partition obtained using the Louvain algorithm. To get an indication of whether the two partitions are good at divising the network into modules, both of these partitions will then be juxtapositioned with random networks, based on the real network. When doing this comparison, we can see if the modularity of the two partitions are significantly different than 0.

First off, we will be getting the partitions based on the genres

We will now be calculating the modularity of the network based on the partitioning obtained using the Louvain algorithm.

We initially see that the modularity obtained by using the Louvain algorithm is more than twice as large as when using the genres.

Building randon network for comparison

Next up, we will be generating a 1000 random networks using the double edge swap algorithm. This makes it so each node in the new random network has the same degree as it had in the original networks, but the connections are different. For each of these random networks, we will be partitioning them using the genres and calculate their modularities. We do 1.2*number of edges swaps to make sure we get a fully random version of the network.

We see that the mean and standard deviation of the modularity is 0, which is to be expected, as the networks are random, and we therefore shouldn't have any good partition using the genres.

To get an overview of the genre partition and the Louvain algorithm partition, we will now plot the distribution of the configuration model's modularity along side the genre partition's modularity and the Louvain algorithm partition's modularity.

Looking at the figure above, we see that both of the partitioning methods leads to a modularity significantly different from 0, and thereby also larger than any of those from the random networks. Through the modularity measure, we can thus deem that the network is not random. Though, as touched upon previously, the modularity of the networks partitioned using the Louvain algorithm is more than twice the size of genre partition. To get an understanding of how this partition looks, we will be visualising the graph with the Louvain partitioning.

Noticeable here is that the Louvain algorithm actually also groups many of the rap, pop, rock and country artists together into four separate groups, though in general also a lot more groups are seen. Let's see just how many groups:

We here see that the Louvain algorithm partitions the network into an immense 4994 groups, which is enormous compared to the 7854 nodes in the graph. An explanation for this is that the large number of singleton nodes are probably given their own group, which gives a good partitioning, but doesn't make much sense compared to a partitioning using genres.

Betweenness centrality

As mentioned previously, we have decided to weigh the nodes in the network with the number of songs that artist has in the data set. The advantage of this, is that the most popular artists will be the ones that are easiest to see, this is especially the case for older artists that haven't collaborated as much - such as Elvis Presley or The Beatles. Artists like these would be virtually invisible if we would have weighted the nodes by the strength of their connections. Though, weighing nodes by the strength of their connections tell a great deal about which nodes are the biggest collaborators, and thereby some of the most central nodes in the graph.

We will therefore in this section deal with betweenness centrality that, for each node in a graph, is a measure of how central that node is. The measure is based on shortest paths in such a way that the betweenness centrality for each node is the number of shortest paths that pass through the node. The formula for betweenness centrality is given by:

\begin{align} BC(n)=\sum_{s\neq v \neq t} \frac{\sigma_{s,t}(n)}{\sigma_{s,t}}, \end{align}

where $\sigma_{s,t}$ is the total number of shortest paths from node $s$ to node $t$ and $\sigma_{s,t}(n)$ is the number of those paths that pass through $n$.

Combining this with weighing the artists by the number of songs they have in the data set will give us a great overview of not just the most popular artists, but also the most central, collaboratory and connective artists.

Having calculated the betweenness centrality for each node, we see that many rappers are present in the top-20. This is not too surprising given the the number of rap artists, their tendency to collaborate and the graph we were looking at earlier. Though we also see names like Quincy Jones, James Ingram and Stevie Wonder - it is interesting to see those artists playing a central part in the network.

Without singletons

The next part of the analysis for the full network is the version where we will be removing singleton nodes with less than 5 songs. The following section will go through the same steps as for the complete network, so not everything will be described with the same level of detail.

Properties

Calculate basic statistics for the network

Compared to the full network, we have now gone down from 7854 to 4154 nodes while keeping the same number of edges. As expected, all the other network properties have gone up, meaning that with a larger density, avg. clustering and average degrees, we should now see a network that is more densely connected.

Analysis of degrees

Looking at the figure above, we again see that the degree distribution follow a power-law.

Community detection

We will again communities of the network using both the genres and the Louvain algorithm, both of which will be compared to random networks.

First off, we will be getting the partitions based on the genres.

We here see a modularity that is exactly the same as before. The formula for the modularity is given by (cf. eq, 9.12 of the NS book):

\begin{align} M= \sum_{c=1}^{n_c}\left\lfloor \frac{L_c}{L}-\left(\frac{k_c}{2L} \right)^2 \right\rfloor \end{align}

Where $n_c$ is the number of communities, $L_c$ is the number of links in community $c$, $L$ is the total number of links in the network and $k_c$ is the total degree of community $c$. This therefore means, that the modularity doesn't depend at all on the number of nodes, and since these are the only things removed from the full network, the modularity doesn't change.

We will now be calculating the modularity of the network based on the partitioning obtained using the Louvain algorithm.

We initially see that the modularity obtained by using the Louvain algorithm is almost the same as for the full network (0.7440). This is due to the Louvain algorithm not being fully optimal and non-deterministic. So as for the full graph, the modularity of the Louvain partition is more than twice the size of the genre partition.

Building randon network for comparison

Next up, we will be generating a 1000 random networks using the double edge swap algorithm. For each of these random networks, we will be partitioning them using the genres and calculate their modularities.

We see that the mean and standard deviation of the modularity is 0, which is to be expected, as the networks are random, and we therefore shouldn't have any good partition using the genres.

To get an overview of the genre partition and the Louvain algorithm partition, we will now plot the distribution of the configuration model's modularity along side the genre partition's modularity and the Louvain algorithm partition's modularity.

Looking at the figure above, we see that both of the partitioning methods leads to a modularity significantly different from 0, and thereby also larger than any of those from the random networks. Though, as touched upon previously, the modularity of the networks partitioned using the Louvain algorithm is more than twice the size of genre partition. To get an understanding of how this partition looks, we will be visualising the graph with the Louvain partitioning.

As with the previous Louvain graph, the algorithm manages to group the main clumps of nodes together quite well. Though noticible is, that the rappers are divided into two groups (light green and black).

Let's see how many groups we have in this partitioning:

We here see that the Louvain algorithm partitions the network into an 1293 groups, which is a lot less compared to the 4992 of the last Louvain network. This means that the number of communities is reduced by 4994 - 1293 = 3701 and having lost 7854 - 4154 = 3700 nodes when removing singletons, it is confirmed, that the Louvain algorithm gives all singleton nodes their own community.

Having now examined the full network for all genres for the musical artists, we will be moving on to analysing some of the most popular genres that we think are interesting.

Pop network

Were here looking at the network of artists who has at least one song with the tag pop in the data set. The size of the nodes will be determined by the number of songs they have with the tag pop.

With singletons

Properties

Calculate basic statistics for the network

In comparison to the full network, the pop network has approximately 3000 fewer nodes, 2900 fewer links, but the density, average clustering and average degree hasn't changes all that much.

Community detection

In this section we will explore the communities of the pop network. We will go through the same steps as previously. First off, we will be getting the partitions based on the genres

We here see a modularity which is lower than what it was for the full network.

We will now be calculating the modularity of the network based on the partitioning obtained using the Louvain algorithm.

Louvain partition modularity is seen to be quite a lot large than the genre modularity.

Building random network for comparison

Next up, we will be generating a 1000 random networks using the double edge swap algorithm. For each of these random networks, we will be partitioning them using the genres and calculate their modularities.

We see that the mean and standard deviation of the modularity is 0, which is to be expected, as the networks are random, and we therefore shouldn't have any good partition using the genres.

To get an overview of the genre partition and the Louvain algorithm partition, we will now plot the distribution of the configuration model's modularity along side the genre partition's modularity and the Louvain algorithm partition's modularity.

Looking at the figure above, we see that both of the partitioning methods leads to a modularity significantly different from 0, and thereby also larger than any of those from the random networks. Though, as touched upon previously, the modularity of the network partitioned using the Louvain algorithm is much larger than using the genre partition. To get an understanding of how this partition looks, we will be visualising the graph with the Louvain partitioning.

Noticeable here is that the Louvain algorithm manages to divide the pop artists into communities that makes decent sense. E.g. some of the rappers and R&B artists are grouped as red nodes, whereas female artists like Taylor swift are seen in very light green and other artists like Beyoncé and Rihanna in light green. Very interesting.

Let's see communities we have in total:

We here see that the Louvain algorithm partitions the network into 3328 groups, which is quite a lot compared to the 4802 nodes in the graph. Again, the large number of singleton nodes is likely the explanation.

Without singletons

This then brings us on to the next analysis for the full network; the version where we will be removing singleton nodes with less than 5 songs. The following section will go through the same steps as as previously.

Properties

Calculate basic statistics for the network

Compared to the full network, we have now gone down from 4802 to 2218 nodes while keeping the same number of edges. As expected, all the other network properties have gone up, meaning that with a larger density, avg. clustering and average degrees, we should now see a network that is more densely connected.

Community detection

We will again communities of the network using both the genres and the Louvain algorithm, both of which will be compared to random networks.

First off, we will be getting the partitions based on the genres.

We will now be calculating the modularity of the network based on the partitioning obtained using the Louvain algorithm.

We initially see that the modularity obtained by using the Louvain algorithm is almost the same as for the full network (0.7053). This is due to the Louvain algorithm not being fully optimal and non-deterministic. So as for the full graph, the modularity of the Louvain partition is more than twice the size of the genre partition.

Building randon network for comparison

Next up, we will be generating a 1000 random networks using the double edge swap algorithm. For each of these random networks, we will be partitioning them using the genres and calculate their modularities.

We see that the mean and standard deviation of the modularity is 0, which is to be expected, as the networks are random, and we therefore shouldn't have any good partition using the genres.

To get an overview of the genre partition and the Louvain algorithm partition, we will now plot the distribution of the configuration model's modularity along side the genre partition's modularity and the Louvain algorithm partition's modularity.

Looking at the figure above, we see that both of the partitioning methods leads to a modularity significantly different from 0, and thereby also larger than any of those from the random networks. Though, as touched upon previously, the modularity of the networks partitioned using the Louvain algorithm is much larger than for the genre partition. To get an understanding of how this partition looks, we will be visualising the graph with the Louvain partitioning.

As with the previous Louvain graph, the algorithm manages to group the main clumps of nodes together quite well.

Let's see how many groups we have in this partitioning:

We here see that the Louvain algorithm partitions the network into 740 groups, which is a lot less compared to the 3328 of the last Louvain network. This means that the number of communities is reduced by 3328 - 740 = 2588 and having lost 4802 - 2218 = 2584 nodes when removing singletons, we again see that the Louvain algorithm gives all singleton nodes their own community.

Retrieving statistics and visualisations for the remaining genres

For the remaining genres: rap, rock, R&B, country, soul, ballad, hip-hop, trap, singer-songwriter and funk, we will be gathering statistics and be making visualisations of the networks with and without singletons with the genre community partition and the Louvain community partition, as this information will be used on the website. Though these results will not be shown here in the notebook, as it would simply take up way too much space.

The following function takes in a genre and a graph -> computes and saves statistics and network graph visualisation for both the genre partition and the Louvain partition for the graph with and without singletons.

Text Analysis

This part of the notebook will contain different analyses of the song lyrics. The main methods which will be used are TF-IDF scores which will be used to create wordclouds, sentiment analysis, dispersion plots and lastly LSA will be performed to calculate similarities between artists. Most of these methods will be applied in multiple scenarios. In general, the songs will be analysed with respect to the decade in which they were released and also according to the genre to which they belong.

Preprocessing lyrics

Prior to conducting any analysis, the lyrics are preprocessed in order to prepare the data. All lyrics are tokenized and lemmatized using nltk and all tokens containing a non-alphabetic character are removed. All characters are made lowercase and for every song each word is only counted once. This is done since it is typical for songs to contain a lot of repetition (as it makes the lyrics easier to remember).

Fraction of genres pr. decade

Since the data stems from the Billboard hot 100 chart it is possible to show how dominant some of the genres have been through time. The figure below shows how much of the music on the chart was labelled as the given genre in each decade. Note that most songs have plenty of genre tags, so the ratios do not sum to 1 (also only the most popular genres are shown).

This graph and the table above illustrate a clear trend. Pop has been dominating for a long time, but since 2010 rap has overtaken the throne. Nowadays even a "subgenre" of rap, namely trap, has also become more popular than pop. Another interesting fact is that rock has almost completely vanished from the charts in the last decade whereas folk has remained pretty consistent throughout time. This graph also illustrates when rap started gaining traction in the US around the eighties.

TF-IDF & Wordcloud

The TF-IDF (term frequency, inverse document frequency) score is a measure of how much a term relates to the characteristics of a document. In this study, terms are of course words in the lyrics of songs and documents can be either decade, genre or artist - according to the scenario we are interested in analysing. The TF is simply how many times a given term occurs in the document and IDF is a measure of how unique the term is given by:

\begin{equation} \text{idf}(t, D) = \log\left(\frac{N}{|d\in D:t\in d|}\right) \end{equation}

where $t$ is a term and $D$ is the set of documents, denoted as the corpus. The TF-IDF is the product of TF and IDF, meaning that terms are most important if they occur frequently in the given document while also not appearing in any other document.

Genres

The data contains 582 genres. Many of these are sub-genres of the main genres which we all know and love. Importantly, many songs are tagged as several different genres. This is handled by assigning the song to all genres of which it is tagged. This creates some overlap between the genres, but this is only an issue for subgenres. Using all genres is thus not desirable since it is not relevant how pop relates to dance-pop or alternative-pop, but it is relevant how pop relates to rap and rock. Therefore, the genres which will constitute the corpus were handselected from the genres which appear the most from 1960-2022.

NOTE: The output of the next section has been limited in order to not clutter the note book too much. If you want to see the full output you can view it under the Genres part of the Text Analysis section on the webpage.

As is evident by the output above, the TF-IDF scores succeed in highlighting a lot of the characteristics of the different genres.

Wordclouds are useful for illustrating the important terms since the importance corresponds to the fontsize of the term. This makes for a nice visual representation which grants a much clearer overview of the similarities and differences between documents (in this case genres)

As a small note, the wordclouds are displayed with masks of well-known musicians from the given genre. The original images are transparently overlayed to aid the image clearity. Theses images are used on the website and to avoid any ugly background a background-removing-helper-function is implemented

The masks have been chosen somewhat arbitrarily, but hopefully some of the artists are recognisable. Looking at the wordcloud for country an extremely clear tendency is evident. All terms of significant TF-IDF score describes everyday-activities relevant for farmers in the US and alike. The UK wordcloud contains a lot of british slang such as mum, paigons, blud and ting, and the rap wordcloud is all about the harsh language which is known for today.

Decade

The same procedure is then done while instead dividing the songs according to the decade in which they were released.

NOTE: The output of the next section has been limited in order to not clutter the note book too much. If you want to see the full output you can view it under the Decades part of the Text Analysis section on the webpage.

NOTE: The output of the next section has been limited in order to not clutter the note book too much. If you want to see the full output you can view it under the Decades part of the Text Analysis section on the webpage.

In the 60's, seventies and eighties, most words are completely normal words which everyone might use in their everyday life. Some perhaps more expressive and expressive than ordinary speech, but still real words. Also some quite romantic words like tenderly are used. In the 60's, the word watusi was used a lot. That is because it is the name of a popular dance at the time. In the 70's, doggone is used a lot. It has in more reason times been completely replaced with the term damn. In the seventies the term nigger also has a high TF-IDF score which is surprising, but the reason is that 5 different songs mentions the word in the 70's and it is never mentioned in another decade. In most of these songs it is used to provoke.

The 90's almost seem like a transitioning time from the old school to the new school of mainstream music. That is when rap entered the music scene for good. In the 00's mostly slang words fill the wordcloud. These slang words are mainly attributed to the rap/hip-hop artists. Some examples are shawty and swag. Also some of the most influencial artists and producers appear such as Ludacris and Darkchild.

Lastly, in the 10's and 20's the wordclouds are filled with ad-libs such as skrrt, brrt, ayy and baow, and modern slang/shorthands like opp meaning opponent, and hunnid meaning hundred.

Artists

Since there is 7855 artists in the dataset the artists which will be considered in the corpus will be those which have managed to appear on the hot 100 chart at least 10 times. This is done to achieve documents which actually can have different term frequencies for each term and also to show how the well-known artists differ from each other in their use of words. Identically to the genres, some songs are shared by multiple artists (thank god for that, otherwise there would be no network). This is handled in the same way, meaning that if two artists colaborated on a song, then they both are assigned all the words in the song. This seems fair since putting ones name on a song automaticly means you are associated with the whole song.

This still is quite a lot of musicians, so some of the most well-known artists have been selected for investigation. In total there are 41 selected artists for whom a picture of them is available - making the wordclouds nicer to look at! These artists are:

NOTE: The output of the next section has been limited in order to not clutter the note book too much. If you want to see the full output you can view it under the Artists part of the Text Analysis section on the webpage.

NOTE: The output of the next section has been limited in order to not clutter the note book too much. If you want to see the full output you can view it under the Artists part of the Text Analysis section on the webpage.

These wordclouds tell much the same story as those of the genres and the decades. It is clear that musicians from the sixties and seventies (although also regarded as pop artists) use a vastly different language compared to the musicians who thrive today in the mainstream music scene. One example is Frank Sinatra who uses a lot of long and vey expressive words such as inconcievable or reminding. Another word which shows signs of the time when Frank Sinatra published his music is the word musical which certainly was a thing which was more popular back in the day.

The mainstream rappers such as Juice Wrld uses a lot of swearwords and ad-libs. Juice Wrld died due to an overdosis at a very young age and it is no secret that he was an addict. This makes sense since his wordcloud is overrun with drugs.

Another good comparison is the fact that Elvis uses the word darling a lot whereas popular pop and rap artists nowadays use the word bitch and hoe A LOT more. It is also clear that the audience has changed a lot through the years.

Dispersion plot

Dispersion plots are interesting as they can give an indication of when certain words were used in music throughout time. As the data table is sorted according to release date it is simple to create a dispersion plot of all the songs. A small modification to the nltk dispersion_plot function had to be implemented to allow for the xticks to be the decades. The function for plotting dispersion plots with custom xticks is shown below with the appertaining dispersion plot of certain handpicked words which illustrate a shift in the language of the mainstream music scene.

One can spend an endless amount of time coming up with interesting terms which define certain periods. Thus the dispersion plot above is far from exhaustive of the trends which came and went throughout the last six decades. However, it does tell an interesting story and it illustrates the beginning and ends of eras.

For example, it seems almost as if the sweet word darling was fased out during the nineties and replaced with the more degrading word bitch. boogie and funky also illustrate the rise and fall of funk music. It almost seems from the plot that it died out a bit in the late eighties and then came back in the nineties.

As rap hit the mainstream in the early nineties the word nigga became a ficed part of the rap songs made by black rappers. The words swag and shorty followed around year 2000 - 2010 but has become less used in present time.

The word watusi is included as it is the name of a specific dance which was popular in the sixites. That is also easy to see in the dispersion plot at is almost never used after 1970.

Sentiment analysis

Next the sentiment of the genres, decades and artists is investigated. Here the labMT Hedonometer data from class is used as a lookup table for the sentiment of terms. The sentiment score ranges from 0-10, where 0 is extremely negative and 10 is extremely positive. In order to allow for fast lookup the words are stored in a dictionary with there corresponding sentiment scores. Lastly, the sentiment of a document is computed as a weighted average of the sentiment of all words in the given document which have a sentiment score in the Hedonometer dataframe. All other words are removed so that they do not count towards the average sentiment score. Otherwise this would lead to them counting as 0 e.g. the most negative word one could imagine. Another option is to set those words to have sentiment 5 (which is in the middel), but that may create a bias since the actual average of the sentiment scores in the Hedonometer data is not 5.

Genre

ones again the focus is on the genres previously defined as being the most popular through time.

The results of the sentiment analysis is not very surprising. Most genres have about equal sentiment, but rap and trap do have the lowest sentiment scores albeit still above the average sentiment of all the words in the Hedonometer data. Among the happpiest genres are jazz, soul, funk and country, closely followed by pop.

Decade

The same proceedure is caried out now focusing on the decades, however the sentiment for each month is also calculated along with a rolling 1 year average to illustrate the finer nuances of the trend in sentiment.

The plot displays what have already been established. It seems that lyrics have become less happy through time, and especially in the reason years. Of course this also can be linked to the rise of the angry genres such as rap and its offspring trap. An example was seen in the dispersion plot where darling was used until the nineties where bitch replaced it.

Artist

NOTE: The output of the next section has been limited in order to not clutter the note book too much. If you want to see the full output you can view it under the Artists part of the Text Analysis section on the webpage.

The distribution in light blue is over all 7855 artists. The green distribution is only over the 735 top artists. The plots show the the tendency that the old pop artists such as The Beatles and Frank Sinatra have happier lyrics, whereas rappers fall within the left part of the distribution with the lowest average sentiment. In the middle we see a lot of popular pop-artists from the last two decades.

LSA

Latent semantic analysis is a method for processing text where the relationship between documents and terms is analysed. In particular, it will here be used to compute the similarity scores between artists. The aim is to uncover which artists are most alike, but also which artists are the least alike. Perhaps it will indicate artists who have used the same ghost-writers. Since songs with colaborations are assigned to all colaborating artists, this means that they will be a lot more likely to be similar. That does in the meantime not mean that the result will not be interesting. Also as mentioned before, one should think twice about putting their name on a song with lyrics that do not fit their agenda. Consine similarity is used since all artists are mapped into a D-dimensional space where D corresponds to the total number of words in the vocabulary. In this case D=50697, which is a lot!

To illustrate what can be done with this technique, the five artists most and least similar to justin bieber is shown above. The most similar artists are pop artists. Chris Brown and Drake belong to r&b and rap respectively, however it can certainly be argued that they are quite "poppy". It should also be noted that Taylor Swift and Justin Bieber have not colaborated on a song, so the bias is not completely ruining the similarity scores. Looking at the least similar artists, it is a mix of differet genres. K.A.A.N. is a rapper and Kali Uchis is a quite modern r&b artist.

NOTE: The output of the next section has been limited in order to not clutter the note book too much. If you want to see the full output you can view it under the Artists part of the Text Analysis section on the webpage.

4. Discussion

Overall, we are quite satisfied with the results from the project. We have been able to find interesting attributes for collaborations of artists via our network analysis, and our text analysis shows how the language of the songs we listen to has changed throughout the years, but also from artist to artist and genre to genre.

The custom styling for the website that we created, had a huge role in being able to display the networks and text analysis parts, without overwhelming the reader with a mile long page. If time had permitted it, we would have liked to delve even deeper into the website, adding small features and making the layout even better.

Using the network theory from the course we have been able to create thorough analyses of the different networks for each genre. Furthermore, we expanded on the course material by calculating the betweenness centrality of the networks, in order to see which artists were more collaborative than others.

Unfortunately, an early look into the lexical diversity of the lyrics did not show much, and thus it was not prioritised as highly as the other aspects of the text analysis. Given more time, it would be interesting to look into this more thoroughly.